In this project, I plan to implement several regression models (Multiple Linear Regression, random forest, Lasso, Neural Net and etc.) to explore the relation among happiness scores and other factors with the information from United Nations Sustainable Development Solutions Network (UNSDSN).Besides, I plan to find a proper model to predict the happiness score. This dataset is from the Gallup World Poll. This dataset gives the happiness rank and happiness score of 155 countries through 2015 to 2017 around the world based on seven factors including family, life expectancy, economy, generosity, trust in government, freedom, and dystopia residual.
library(data.table)
library(tidyverse)
library(corrplot)
library(plotly)
library(wildcard)
library(PerformanceAnalytics)
library(DT)
library(maps)
library(caTools)
library(ggpubr)
library(leaps)
library(boot)
library(glmnet)
library(reshape)
library(car)
library(mgcv)
library(gamclass)
library(e1071)
library(randomForest)
library(neuralnet)
library(BSDA)
library(knitr)wh15 <- fread("~/Downloads/GWU/6214_Barut/Final/world-happiness-report/2015.csv",data.table = FALSE)
wh16 <- fread("~/Downloads/GWU/6214_Barut/Final/world-happiness-report/2016.csv",data.table = FALSE)
wh17 <- fread("~/Downloads/GWU/6214_Barut/Final/world-happiness-report/2017.csv",data.table = FALSE)
wh15$year <- 2015
wh16$year <- 2016
wh17$year <- 2017
names(wh17)[2] <- "Happiness Rank"
names(wh17)[3] <- "Happiness Score"
names(wh17)[6] <- "Economy (GDP per Capita)"
names(wh17)[8] <- "Health (Life Expectancy)"
names(wh17)[11] <- "Trust (Government Corruption)"
names(wh17)[12] <- "Dystopia Residual"
wh15_17 <- bind_rows(wh15,wh16,wh17)
names(wh15_17)[5] <- "SD_error"
wh15_17 <- wh15_17 %>% select(Country:year,-Region, -SD_error)
names(wh15_17) <- c("Country","Happiness_Rank","Happiness_Score","Economy",
"Family","Health","Freedom","Trust",
"Generosity","Dystopia_Residual","year")There are 12 variables in this dataset.
text_tbl <- data.frame(
Variables=c("Country","Region","Happiness Rank","Happiness Score","Standard Error","Economy (GDP per Capita)","Family","Health (Life Expectancy)","Freedom","Trust (Government Corruption)","Generosity","Dystopia Residual"),
Description=c("Name of the country.","Region the country belongs to.","Rank of the country based on the Happiness Score.","A metric measured in 2015 by asking the sampled people the question:How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest.","The standard error of the happiness score.","The extent to which GDP contributes to the calculation of the Happiness Score.","The extent to which Family contributes to the calculation of the Happiness Score.","The extent to which Life expectancy contributed to the calculation of the Happiness Score","The extent to which Freedom contributed to the calculation of the Happiness Score.","The extent to which Perception of Corruption contributes to Happiness Score.","The extent to which Generosity contributed to the calculation of the Happiness Score.","The extent to which Dystopia Residual contributed to the calculation of the Happiness Score."))
kable(text_tbl)| Variables | Description |
|---|---|
| Country | Name of the country. |
| Region | Region the country belongs to. |
| Happiness Rank | Rank of the country based on the Happiness Score. |
| Happiness Score | A metric measured in 2015 by asking the sampled people the question:How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest. |
| Standard Error | The standard error of the happiness score. |
| Economy (GDP per Capita) | The extent to which GDP contributes to the calculation of the Happiness Score. |
| Family | The extent to which Family contributes to the calculation of the Happiness Score. |
| Health (Life Expectancy) | The extent to which Life expectancy contributed to the calculation of the Happiness Score |
| Freedom | The extent to which Freedom contributed to the calculation of the Happiness Score. |
| Trust (Government Corruption) | The extent to which Perception of Corruption contributes to Happiness Score. |
| Generosity | The extent to which Generosity contributed to the calculation of the Happiness Score. |
| Dystopia Residual | The extent to which Dystopia Residual contributed to the calculation of the Happiness Score. |
The seven factors which influence Happiness Score are Economy, Family, Health, Freedom, Trust,generosity and dystopia residual. Sum of the value of these seven factors gives us the happiness score and the higher the happiness score, the lower the happiness rank. So, it is evident that the higher value of each of these seven factors means the level of happiness is higher. We can define the meaning of these factors as the extent to which these factors lead to happiness. Dystopia is the opposite of utopia and has the lowest happiness level. Dystopia will be considered as a reference for other countries to show how far they are from being the poorest country regarding happiness level. After understanding the dataset, I import, arrange and clean the data into R studio. The next section shows some features of this dataset.
head(wh15_17)names(wh15_17)## [1] "Country" "Happiness_Rank" "Happiness_Score"
## [4] "Economy" "Family" "Health"
## [7] "Freedom" "Trust" "Generosity"
## [10] "Dystopia_Residual" "year"
str(wh15_17)## 'data.frame': 470 obs. of 11 variables:
## $ Country : chr "Switzerland" "Iceland" "Denmark" "Norway" ...
## $ Happiness_Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Happiness_Score : num 7.59 7.56 7.53 7.52 7.43 ...
## $ Economy : num 1.4 1.3 1.33 1.46 1.33 ...
## $ Family : num 1.35 1.4 1.36 1.33 1.32 ...
## $ Health : num 0.941 0.948 0.875 0.885 0.906 ...
## $ Freedom : num 0.666 0.629 0.649 0.67 0.633 ...
## $ Trust : num 0.42 0.141 0.484 0.365 0.33 ...
## $ Generosity : num 0.297 0.436 0.341 0.347 0.458 ...
## $ Dystopia_Residual: num 2.52 2.7 2.49 2.47 2.45 ...
## $ year : num 2015 2015 2015 2015 2015 ...
summary(wh15_17)## Country Happiness_Rank Happiness_Score Economy
## Length:470 Min. : 1.00 Min. :2.693 Min. :0.0000
## Class :character 1st Qu.: 40.00 1st Qu.:4.509 1st Qu.:0.6053
## Mode :character Median : 79.00 Median :5.282 Median :0.9954
## Mean : 78.83 Mean :5.371 Mean :0.9278
## 3rd Qu.:118.00 3rd Qu.:6.234 3rd Qu.:1.2524
## Max. :158.00 Max. :7.587 Max. :1.8708
## Family Health Freedom Trust
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.7930 1st Qu.:0.4023 1st Qu.:0.2976 1st Qu.:0.05978
## Median :1.0257 Median :0.6301 Median :0.4183 Median :0.09950
## Mean :0.9903 Mean :0.5800 Mean :0.4028 Mean :0.13479
## 3rd Qu.:1.2287 3rd Qu.:0.7683 3rd Qu.:0.5169 3rd Qu.:0.17316
## Max. :1.6106 Max. :1.0252 Max. :0.6697 Max. :0.55191
## Generosity Dystopia_Residual year
## Min. :0.0000 Min. :0.3286 Min. :2015
## 1st Qu.:0.1528 1st Qu.:1.7380 1st Qu.:2015
## Median :0.2231 Median :2.0946 Median :2016
## Mean :0.2422 Mean :2.0927 Mean :2016
## 3rd Qu.:0.3158 3rd Qu.:2.4556 3rd Qu.:2017
## Max. :0.8381 Max. :3.8377 Max. :2017
Deleting countries that were not listed in all three years. Tip: Some countries underwent a name change such as Hong Kong S.A.R., China to Hong Kong in 2017 for political reasons. While other countries-as admitted by the UNSDSN-did not fulfill the survey or gave the survey to its citizens. In order to minimize issues, the committee used the 2014 data of these countries. This of course affects the accuracy of the data, which was stated under the source validity.
cwithout3years <- wh15_17 %>% group_by(Country) %>% mutate(count = sum(year))
cwithout3years %>% filter(count != 6048) %>% select(Country, Happiness_Rank, year) %>% arrange(Country)corrplot(cor(wh15_17 %>%
select(Happiness_Score:Dystopia_Residual)), method="color",
sig.level = 0.01, insig = "blank",addCoef.col = "black", tl.srt=45,
type="upper")According to the above correlation plot, obviously, there is a positive correlation between “Happiness Score” and all the other numerical variables. In other words, the higher the happiness score, the higher the other seven factors that contribute to happiness. Specifically, Economy and Health play the most significant role in contributing to happiness. Generosity have the lowest impact on the happiness score.
Since the Economy and Health play the most significant role in contributing to happiness, I draw a scatter plot of Happiness betwen Economy and Health.
plot_ly(data = wh15_17, x=~Economy, y=~Happiness_Score, color=~Health,
type = "scatter",text = ~paste("Country:", Country)) %>%
layout(title = "Relationship between Happiness, GDP and Health ",
xaxis = list(title = "GDP "),
yaxis = list(title = "Happiness Score"))hist(wh15_17$Happiness_Score , xlab = "World Happiness Score", main = "World Happiness Score during three years")It’s easy to see that the World Happiness Score shows a normal distribution. 5 and 6 get the highest frequencey.
world <- map_data('world')
world <- world %>% filter(region != "Antarctica")
world <- fortify(world)
#2015 World Happiness Score Map
happiness.score15 <- wh15_17 %>% select(Country, Happiness_Score, year) %>% filter(year == 2015)
happiness.score15 <- wildcard(df = happiness.score15, wildcard = "United States", values = "USA", expand = TRUE, rules = NULL)
happiness.score15 <- wildcard(df = happiness.score15, wildcard = "United Kingdom", values = "UK", expand = TRUE, rules = NULL)
happiness.score15 <- wildcard(df = happiness.score15, wildcard = "Democratic Republic of the Congo", values = "Congo (Kinshasa)",expand = TRUE, rules = NULL)
ggplot() +
geom_map(data=world, map=world,
aes(x=long, y=lat, group=group, map_id=region),
fill="white", colour="black") +
geom_map(data=happiness.score15, map=world,
aes(fill=Happiness_Score, map_id=Country),colour="black") +
scale_fill_continuous(low="red", high="pink",guide="colorbar") +
labs(title = "2015 World Happiness Score Map")#2016 World Happiness Score Map
happiness.score16 <- wh15_17 %>% select(Country, Happiness_Score, year) %>% filter(year == 2016)
happiness.score16 <- wildcard(df = happiness.score16, wildcard = "United States", values = "USA", expand = TRUE, rules = NULL)
happiness.score16 <- wildcard(df = happiness.score16, wildcard = "United Kingdom", values = "UK", expand = TRUE, rules = NULL)
happiness.score16 <- wildcard(df = happiness.score16, wildcard = "Democratic Republic of the Congo", values = "Congo (Kinshasa)",expand = TRUE, rules = NULL)
ggplot() +
geom_map(data=world, map=world,
aes(x=long, y=lat, group=group, map_id=region),
fill="white", colour="black") +
geom_map(data=happiness.score16, map=world,
aes(fill=Happiness_Score, map_id=Country),colour="black") +
scale_fill_continuous(low="red", high="pink",guide="colorbar") +
labs(title = "2016 World Happiness Score Map")#2017 World Happiness Score Map
happiness.score17 <- wh15_17 %>% select(Country, Happiness_Score, year) %>% filter(year == 2017)
happiness.score17 <- wildcard(df = happiness.score17, wildcard = "United States", values = "USA", expand = TRUE, rules = NULL)
happiness.score17 <- wildcard(df = happiness.score17, wildcard = "United Kingdom", values = "UK", expand = TRUE, rules = NULL)
happiness.score17 <- wildcard(df = happiness.score17, wildcard = "Democratic Republic of the Congo", values = "Congo (Kinshasa)",expand = TRUE, rules = NULL)
ggplot() +
geom_map(data=world, map=world,
aes(x=long, y=lat, group=group, map_id=region),
fill="white", colour="black") +
geom_map(data=happiness.score17, map=world,
aes(fill=Happiness_Score, map_id=Country),colour="black") +
scale_fill_continuous(low="red", high="pink",guide="colorbar") +
labs(title = "2017 World Happiness Score Map")Comparing the three graphs above, we can easily see that most part of North America, South America and Australia show higer scores among all countries while most of the Africa countries display lower scores. From the difference of these three pictures, we can find that the score of several Africa and South America counteris getting lower as the year increasing. The situation may caused by the war, economy or other factors.
world.happiness17 <- wh15_17 %>% filter(year == 2017)
top5 <- world.happiness17 %>% head(5) %>% mutate(Level = "TOP5")
middle5 <- world.happiness17[76:80, ] %>% mutate(Level = "MIDDLE5")
worst5 <- world.happiness17 %>% tail(5) %>% mutate(Level = "WORST5")
comparison <- bind_rows(top5, middle5, worst5)
comparison$Level <- as.factor(comparison$Level)
comparison <- transform(comparison, Level = factor(Level, levels = c("TOP5", "MIDDLE5", "WORST5" )))
datatable(comparison,
options = list(
lengthMenu = c(5, 10, 15)),
caption = htmltools::tags$caption(style = 'caption-side: bottom; text-align: center;',
htmltools::em('Data table that only includes top5, middle5 and worst5 countries'))
)We still looking at and analyzing deeper in this dataset to get a much more accurate report. Thus, I’ll do the regression between happiness score and other seven factors in the below section.
p1 <- ggplot( data=wh15_17,aes(x = Economy, y = Happiness_Score)) +
geom_point(size = 0.5, alpha = 0.8,color="green") +
geom_smooth(method = "lm", fullrange = TRUE)
p2 <- ggplot( data=wh15_17,aes(x = Family, y = Happiness_Score)) +
geom_point(size = 0.5, alpha = 0.8,color="green") +
geom_smooth(method = "lm", fullrange = TRUE)
p3 <- ggplot( data=wh15_17,aes(x = Health, y = Happiness_Score)) +
geom_point(size = 0.5, alpha = 0.8,color="green") +
geom_smooth(method = "lm", fullrange = TRUE)
p4 <- ggplot( data=wh15_17,aes(x = Freedom, y = Happiness_Score)) +
geom_point(size = 0.5, alpha = 0.8,color="green") +
geom_smooth(method = "lm", fullrange = TRUE)
p5 <- ggplot( data=wh15_17,aes(x = Trust, y = Happiness_Score)) +
geom_point(size = 0.5, alpha = 0.8,color="green") +
geom_smooth(method = "lm", fullrange = TRUE)
p6 <- ggplot( data=wh15_17,aes(x = Generosity, y = Happiness_Score)) +
geom_point(size = 0.5, alpha = 0.8,color="green") +
geom_smooth(method = "lm", fullrange = TRUE)
p7 <- ggplot( data=wh15_17,aes(x = Dystopia_Residual, y = Happiness_Score)) +
geom_point(size = 0.5, alpha = 0.8,color="green") +
geom_smooth(method = "lm", fullrange = TRUE)
ggarrange(p1, p2, p3, p4, p5, p6, p7, ncol=3,nrow = 3)Before fitting the regression, I first take a look at scatter plot between happiness score(dependent variable) and seven factors(independent variables).The above graphs show that while most factors correlate with happiness score in some way, some correlate much more significantly that others. Economy and Health seem to have a strong correlation with the smallest amount of variance whilst Freedom, and Family have a strong correlation but with very high variance. Interestingly generosity seems to have no trend, whether your happiness is low or high there is nonetheless a large variance in generosity.
I split the Dataset into training and test set by using ratio 0.8. first 5 rows of training set and test set.
set.seed(666)
wh_reg <- wh15_17[3:10]
split = sample.split(wh15_17$Happiness_Score, SplitRatio = 0.8)
training_set = subset(wh_reg, split == TRUE)
test_set = subset(wh_reg, split == FALSE)reg_fwd <- regsubsets(Happiness_Score ~ .,data = training_set,nvmax=13,method="forward")
summary(reg_fwd)## Subset selection object
## Call: regsubsets.formula(Happiness_Score ~ ., data = training_set,
## nvmax = 13, method = "forward")
## 7 Variables (and intercept)
## Forced in Forced out
## Economy FALSE FALSE
## Family FALSE FALSE
## Health FALSE FALSE
## Freedom FALSE FALSE
## Trust FALSE FALSE
## Generosity FALSE FALSE
## Dystopia_Residual FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
## Economy Family Health Freedom Trust Generosity Dystopia_Residual
## 1 ( 1 ) "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " "*"
## 3 ( 1 ) "*" " " " " "*" " " " " "*"
## 4 ( 1 ) "*" "*" " " "*" " " " " "*"
## 5 ( 1 ) "*" "*" "*" "*" " " " " "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
fwd_summary <- summary(reg_fwd)
plot(fwd_summary$bic)The summary lists the best model for a given size. For instance, the best model of size 1 includes Economy and the best model of size 2 includes Economy and Dystopia_Residual. Besides, in the plot, the lowest BIC is achieved by the 12th model, which includes all of the variables. I will also check the model using cross validation.
CVmse <- rep(0,7)
for(i in 1:7){
tempCols <- which(fwd_summary$which[i,-1]==TRUE)
tempCols <- c(tempCols,7)
tempCols <- as.numeric(tempCols)
tempGLM <- glm(Happiness_Score~.,data=training_set[,tempCols])
tempCV <- cv.glm(tempGLM,data=training_set[,tempCols],K = 10)
CVmse[i] <- tempCV$delta[1]
}
plot(CVmse)The model with the lowest cross validation error is also the 7th model and the minimum MSE equals 0.32851.
reg_bwd <- regsubsets(Happiness_Score ~ .,data = training_set,nvmax=13,method="backward")
summary(reg_bwd)## Subset selection object
## Call: regsubsets.formula(Happiness_Score ~ ., data = training_set,
## nvmax = 13, method = "backward")
## 7 Variables (and intercept)
## Forced in Forced out
## Economy FALSE FALSE
## Family FALSE FALSE
## Health FALSE FALSE
## Freedom FALSE FALSE
## Trust FALSE FALSE
## Generosity FALSE FALSE
## Dystopia_Residual FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: backward
## Economy Family Health Freedom Trust Generosity Dystopia_Residual
## 1 ( 1 ) "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " "*"
## 3 ( 1 ) "*" " " " " "*" " " " " "*"
## 4 ( 1 ) "*" "*" " " "*" " " " " "*"
## 5 ( 1 ) "*" "*" "*" "*" " " " " "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
bwd_summary <- summary(reg_fwd)
plot(bwd_summary$bic)Backward stepwise selection recovers the same selection as forward stepwise selection which agrees the multiple linear regression.
lasso.cv <- cv.glmnet(x=as.matrix(training_set[,-1]),y=as.matrix(training_set[,1]),alpha=1,nfolds = 10)
plot(lasso.cv)The lowest MSE is achieved with small values of the penalty parameter lambda. Lowest MSE is achieved when log(Lambda)=-4, and the MSE is around 0.3. This is a model with 7 coefficients. This is obvious the same model we obtain above which including all variables.
At this point most of you should be convinced that the best model uses all the variables. Next, we should check if any nonlinear transformations are necessary. ###4.4.1 Multiple Linear Regression First fitting Multiple Linear Regression to the Training set.
reg_lm = lm(Happiness_Score ~ .,data = training_set)
summary(reg_lm)##
## Call:
## lm(formula = Happiness_Score ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.659e-04 -2.414e-04 -2.620e-06 2.574e-04 5.007e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.876e-05 8.439e-05 0.933 0.351
## Economy 1.000e+00 6.606e-05 15139.332 <2e-16 ***
## Family 1.000e+00 6.362e-05 15718.154 <2e-16 ***
## Health 9.999e-01 1.020e-04 9803.124 <2e-16 ***
## Freedom 1.000e+00 1.323e-04 7555.868 <2e-16 ***
## Trust 9.999e-01 1.616e-04 6186.889 <2e-16 ***
## Generosity 1.000e+00 1.258e-04 7951.857 <2e-16 ***
## Dystopia_Residual 1.000e+00 2.676e-05 37368.142 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002895 on 368 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.328e+08 on 7 and 368 DF, p-value: < 2.2e-16
The summary result shows that all independent variables have a significant impact, and adjusted R squared is 1! As we discussed, it is clear that there is a linear correlation between dependent and independent variables. Since the sum of the all seven factors is equal to the the happiness score. This is the justification for having an adjusted R squared equals to 1. As a result, I guess Multiple Linear Regression will predict happiness scores with 100 % accuracy! Now, check the prediction result with the test set.
pred_lm = predict(reg_lm, newdata = test_set)
Actual_lm <- as.data.frame(cbind(Prediction = pred_lm, Actual = test_set$Happiness_Score))
gg.lm <- ggplot(Actual_lm, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Multiple Linear Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.lmAs I assumed, the plot shows that multiple linear regression works well. Next, plotting the residuals with respect to each covariate.
train.melt <- melt(training_set)
train_melt<- cbind(train.melt,resid=reg_lm$residuals)
ggplot(train_melt,aes(x=value,y=resid))+geom_point()+geom_smooth(method="loess")+facet_wrap(~variable,scales="free")There are obvious non-linearities with respect to Economy, Trust and Dystopia_Residual variables. I will go in that order and try polynomial regression with different terms for each of the variables. I will choose the order of polynomials using cross validation.
EconomyEcoMSE <- rep(0,10)
for(i in 1:10){
templm <- glm(Happiness_Score~.-Economy+poly(Economy,i),data=training_set)
tempCV <- cv.glm(training_set,templm,K = 10)
EcoMSE[i] <- tempCV$delta[1]
}
plot(EcoMSE)Minimum is obtained with a third degree polynomial for Economy.
TrustTruMSE <- rep(0,10)
for(i in 1:10){
templm2 <- glm(Happiness_Score~.-Economy-Trust+poly(Economy,3)+poly(Trust,i),
data=training_set)
tempCV2 <- cv.glm(training_set,templm2,K = 10)
TruMSE[i] <- tempCV2$delta[1]
}
plot(TruMSE)Minimum is obtained with a third degree polynomial for Trust.
Dystopia_ResidualDysMSE <- rep(0,10)
for(i in 1:10){
templm3 <- glm(Happiness_Score~.-Economy-Trust-Dystopia_Residual
+poly(Economy,3)+poly(Trust,3)+poly(Dystopia_Residual,i),data=training_set)
tempCV3 <- cv.glm(training_set,templm3,K = 10)
DysMSE[i] <- tempCV3$delta[1]
}
plot(DysMSE)Minimum is obtained with a eighth degree polynomial for Trust.
After we transfer the non-linear independent variables, I get the final model.
reg_poly <- lm(Happiness_Score~
poly(Economy,3)+Family+Health+Freedom+poly(Trust,3)+Generosity+poly(Dystopia_Residual,8), data = training_set)
summary(reg_poly)##
## Call:
## lm(formula = Happiness_Score ~ poly(Economy, 3) + Family + Health +
## Freedom + poly(Trust, 3) + Generosity + poly(Dystopia_Residual,
## 8), data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.655e-04 -2.315e-04 5.620e-06 2.329e-04 5.611e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.172e+00 8.693e-05 36483.276 <2e-16 ***
## poly(Economy, 3)1 8.053e+00 5.514e-04 14605.597 <2e-16 ***
## poly(Economy, 3)2 4.643e-04 3.542e-04 1.311 0.191
## poly(Economy, 3)3 4.474e-04 3.206e-04 1.396 0.164
## Family 1.000e+00 6.408e-05 15605.489 <2e-16 ***
## Health 1.000e+00 1.068e-04 9361.388 <2e-16 ***
## Freedom 1.000e+00 1.338e-04 7471.399 <2e-16 ***
## poly(Trust, 3)1 2.141e+00 4.129e-04 5184.979 <2e-16 ***
## poly(Trust, 3)2 3.649e-04 3.042e-04 1.199 0.231
## poly(Trust, 3)3 -2.479e-04 3.275e-04 -0.757 0.450
## Generosity 1.000e+00 1.310e-04 7631.350 <2e-16 ***
## poly(Dystopia_Residual, 8)1 1.117e+01 3.049e-04 36645.386 <2e-16 ***
## poly(Dystopia_Residual, 8)2 -7.707e-05 2.986e-04 -0.258 0.796
## poly(Dystopia_Residual, 8)3 3.254e-04 2.939e-04 1.107 0.269
## poly(Dystopia_Residual, 8)4 -1.994e-04 2.954e-04 -0.675 0.500
## poly(Dystopia_Residual, 8)5 -1.707e-04 2.981e-04 -0.573 0.567
## poly(Dystopia_Residual, 8)6 -5.434e-04 2.914e-04 -1.865 0.063 .
## poly(Dystopia_Residual, 8)7 3.854e-04 2.928e-04 1.316 0.189
## poly(Dystopia_Residual, 8)8 -2.544e-04 2.920e-04 -0.871 0.384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002881 on 357 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.27e+08 on 18 and 357 DF, p-value: < 2.2e-16
polyGLM <- glm(Happiness_Score~
poly(Economy,3)+Family+Health+Freedom+poly(Trust,3)+Generosity+poly(Dystopia_Residual,8),data = training_set)
cv.glm(training_set,polyGLM,K=10)$delta[1]## [1] 8.858246e-08
The MSE of the polynomial regression equals to 1.005146e-07 which is obviously smaller than the linear model(0.32851) that we started with.
reg_lminter = lm(Happiness_Score ~ .^2,data = wh_reg)
summary(reg_lminter)##
## Call:
## lm(formula = Happiness_Score ~ .^2, data = wh_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.821e-04 -2.020e-04 3.970e-06 2.358e-04 5.992e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.562e-04 2.926e-04 1.559 0.11979
## Economy 9.993e-01 3.500e-04 2855.111 < 2e-16 ***
## Family 1.000e+00 3.130e-04 3195.636 < 2e-16 ***
## Health 1.000e+00 5.787e-04 1728.575 < 2e-16 ***
## Freedom 9.989e-01 7.218e-04 1383.864 < 2e-16 ***
## Trust 1.001e+00 8.479e-04 1180.386 < 2e-16 ***
## Generosity 1.000e+00 5.848e-04 1710.399 < 2e-16 ***
## Dystopia_Residual 9.999e-01 1.099e-04 9100.127 < 2e-16 ***
## Economy:Family -7.129e-05 2.125e-04 -0.335 0.73742
## Economy:Health -2.791e-04 2.180e-04 -1.280 0.20108
## Economy:Freedom 1.744e-03 5.481e-04 3.182 0.00156 **
## Economy:Trust -1.804e-04 5.796e-04 -0.311 0.75578
## Economy:Generosity 7.849e-04 5.310e-04 1.478 0.14010
## Economy:Dystopia_Residual 5.791e-05 1.111e-04 0.521 0.60237
## Family:Health 5.473e-04 3.972e-04 1.378 0.16901
## Family:Freedom -2.677e-04 4.219e-04 -0.635 0.52605
## Family:Trust -2.666e-04 6.159e-04 -0.433 0.66541
## Family:Generosity -4.028e-04 5.281e-04 -0.763 0.44606
## Family:Dystopia_Residual -5.424e-05 1.104e-04 -0.491 0.62351
## Health:Freedom -1.601e-03 7.389e-04 -2.166 0.03082 *
## Health:Trust 1.140e-03 1.111e-03 1.026 0.30548
## Health:Generosity -1.535e-03 1.002e-03 -1.532 0.12618
## Health:Dystopia_Residual 3.532e-05 1.774e-04 0.199 0.84226
## Freedom:Trust -3.664e-04 1.294e-03 -0.283 0.77717
## Freedom:Generosity 1.165e-03 1.107e-03 1.053 0.29290
## Freedom:Dystopia_Residual 2.061e-04 2.107e-04 0.978 0.32863
## Trust:Generosity -8.158e-04 1.249e-03 -0.653 0.51394
## Trust:Dystopia_Residual -5.501e-04 2.432e-04 -2.261 0.02422 *
## Generosity:Dystopia_Residual 7.007e-05 2.052e-04 0.341 0.73290
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002836 on 441 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.692e+08 on 28 and 441 DF, p-value: < 2.2e-16
We can conclude that there is no interaction effects from the summary above since the interaction terms are all not significant under alpha equals to 0.05.Then after selecting the final model(polynomial regression), several diagnostics should be checked.
Checking if residuals relies Gaussian distribution.
wh15new=filter(wh15_17,year==2015)
reg_15 <- lm(Happiness_Score ~poly(Economy,3)+Family+Health+Freedom+poly(Trust,3)+Generosity+poly(Dystopia_Residual,8),
data = wh15new)
shapiro.test(reg_15$residuals)##
## Shapiro-Wilk normality test
##
## data: reg_15$residuals
## W = 0.9835, p-value = 0.05664
According to the result from Shapiro-Wilk test,the p-value of the ncvTest is higher than a significance level of 0.05, therefore we can’t reject the null hypothesis that the the residuals relies normally and infer that Gaussian distribution assumption of residuals is matched.
Checking whether residuals are homoskedastic, in other words, checking if residuals have the same variance.
plot(reg_poly,which=1)ncvTest(reg_poly)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.04607611, Df = 1, p = 0.83004
The p-value of the ncvTest is higher than a significance level of 0.05, therefore we can’t reject the null hypothesis that the variance of the residuals is constant and infer that homoskedasticity assumption is fit.
Confidence interval from noraml approximation
poly_pred <- predict(reg_poly,training_set,interval = "confidence")
z.test(poly_pred,sigma.x = 6.8)##
## One-sample z-Test
##
## data: poly_pred
## z = 26.557, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 4.980159 5.773815
## sample estimates:
## mean of x
## 5.376987
Confidence interval from Bootstrap
sim<-data.frame(MeanRep=replicate(1e5,mean(sample(training_set$Happiness_Score,replace = T))))
point<-round(mean(sim$MeanRep))
conf<-data.frame(Val=round(quantile(sim$MeanRep,c(.025,1-.025)),digits = 2))
ggplot(sim,aes(x=MeanRep))+
geom_density(fill="steelblue3",alpha=.33,size=1.4)+
scale_x_continuous(breaks = seq(0,40,2))+
geom_vline(size=1.4,linetype=2,aes(xintercept=mean(MeanRep)))+
geom_vline(size=1.4,linetype=2,color="red",aes(xintercept=quantile(sim$MeanRep,.025)))+
geom_vline(size=1.4,linetype=2,color="red",aes(xintercept=quantile(sim$MeanRep,1-.025)))+
theme_bw()+theme(axis.text = element_text(color="black"))+
labs(x="Mean Happiness Score",y="Density",subtitle=paste(" Point Estimate:",point,"\n","95% Conf. Interval:",conf[1,],"-",conf[2,]))Under 95% confidence, the normal approximation shows that the point estimate of happiness score’s mean is 5.38 and the confidence interval lies between 4.98 and 5.77 whilist the bootstrap confidence interval falls within 5.26 and 5.49 within the point estimate 5. Thus the normal approximation did better job here than bootstrap.
In this part, I’ll use splines for the polynomial terms fitting an additive model.
gam.happy <- gam(Happiness_Score~
s(Economy)+Family+Health+Freedom+s(Trust)+Generosity+s(Dystopia_Residual),
data = training_set)
summary(gam.happy)##
## Family: gaussian
## Link function: identity
##
## Formula:
## Happiness_Score ~ s(Economy) + Family + Health + Freedom + s(Trust) +
## Generosity + s(Dystopia_Residual)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.172e+00 8.486e-05 37373 <2e-16 ***
## Family 1.000e+00 6.348e-05 15752 <2e-16 ***
## Health 9.999e-01 1.023e-04 9777 <2e-16 ***
## Freedom 1.000e+00 1.320e-04 7574 <2e-16 ***
## Generosity 1.000e+00 1.263e-04 7920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Economy) 1.505 1.871 1.224e+08 <2e-16 ***
## s(Trust) 1.493 1.836 1.881e+07 <2e-16 ***
## s(Dystopia_Residual) 1.000 1.000 1.390e+09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 1 Deviance explained = 100%
## GCV = 8.5368e-08 Scale est. = 8.3326e-08 n = 376
CVgam(formula(gam.happy),training_set,nfold=10)## GAMscale CV-mse-GAM
## 0 0
The MSE of this model equals 0 which is perfect and close to the result we get from polynomial regression. Then I try to test the quality of additive model using test set.
pred_add = predict(gam.happy, newdata = test_set)
Actual_add <- as.data.frame(cbind(Prediction = pred_add, Actual = test_set$Happiness_Score))
gg.add <- ggplot(Actual_add, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Additive Model", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.addAs I expected, the result shows that the additive model fits the dataset perfectly.
Fitting SVR to the training dataset and check the quality of regression using test dataset.
reg_svr = svm(formula = Happiness_Score ~ .,data = training_set,
type = 'eps-regression',kernel = 'radial')
pred_svr = predict(reg_svr, newdata = test_set)
Actual_svr <- as.data.frame(cbind(Prediction = pred_svr, Actual = test_set$Happiness_Score))
gg.svr <- ggplot(Actual_svr, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "SVR", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.svrThe plot indicates that SVR shows an excellent job in this dataset.
Fitting Random Forest Regression to the training dataset and check the quality of regression using test dataset.
set.seed(1234)
reg_rf = randomForest(x = training_set[-1],
y = training_set$Happiness_Score,
ntree = 500)
pred_rf = predict(reg_rf, newdata = test_set)
Actual_rf <- as.data.frame(cbind(Prediction = pred_rf, Actual = test_set$Happiness_Score))
gg.rf <- ggplot(Actual_rf, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Random Forest Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.rfIt can be seen that Randon Forest regression is not as good as SVR and multiple linear regression regarding predicted happiness scores.
Fitting Neural Net to the training dataset and check the quality of regression using test dataset.
nn <- neuralnet(Happiness_Score ~Economy + Family + Health + Freedom + Generosity + Trust + Dystopia_Residual, data=training_set,hidden=10,linear.output=TRUE)
pred_nn <- compute(nn,test_set[,2:8])
Pred_Actual_nn <- as.data.frame(cbind(Prediction = pred_nn$net.result, Actual = test_set$Happiness_Score))
gg.nn <- ggplot(Pred_Actual_nn, aes(Actual, V1 )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Neural Net", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.nnThe graphs show that Neural Net is the best predictor after multiple linear regression.
pred_poly = predict(reg_poly, newdata = test_set)
Actual_poly <- as.data.frame(cbind(Prediction = pred_poly, Actual = test_set$Happiness_Score))
gg.poly <- ggplot(Actual_poly, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Polynomial regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.polyggarrange(gg.lm, gg.poly, gg.add,gg.svr,gg.rf, gg.nn, ncol = 2, nrow = 3)MSE.lm <- sum((test_set$Happiness_Score - pred_lm)^2)/nrow(test_set)
MSE.poly <- sum((test_set$Happiness_Score - pred_poly)^2)/nrow(test_set)
MSE.add <- sum((test_set$Happiness_Score - pred_add)^2)/nrow(test_set)
MSE.svr <- sum((test_set$Happiness_Score - pred_svr)^2)/nrow(test_set)
MSE.rf <- sum((test_set$Happiness_Score - pred_rf)^2)/nrow(test_set)
MSE.nn <- sum((test_set$Happiness_Score - pred_nn$net.result)^2)/nrow(test_set)
print(paste("Mean Squared Error (Multiple Linear Regression):", MSE.lm))## [1] "Mean Squared Error (Multiple Linear Regression): 0.0000000802962922642319"
print(paste("Mean Squared Error (polynomial Regression):", MSE.poly))## [1] "Mean Squared Error (polynomial Regression): 0.0000000987572476581496"
print(paste("Mean Squared Error (SVR):", MSE.svr))## [1] "Mean Squared Error (SVR): 0.0182972257535053"
print(paste("Mean Squared Error (Random Forest):", MSE.lm))## [1] "Mean Squared Error (Random Forest): 0.0000000802962922642319"
print(paste("Mean Squared Error (Neural Net):", MSE.nn))## [1] "Mean Squared Error (Neural Net): 0.000274136341683124"
According to the comparison, Multiple Linear Regression, Polynomial Regression,additive model and Neural Net did the best job. Besides, the result of MSE also confirm the result that the Multiple Linear Regression, Polynomial Regression, additive model and Neural Net predicted approximately the same which is a little bit better than SVR.
library(huge)
wh_matrix <- data.matrix(wh_reg)
out=huge(wh_matrix, nlambda = 40, lambda.min.ratio = 0.4, method = "glasso")## Conducting the graphical lasso (glasso) with lossless screening....in progress:0%
Conducting the graphical lasso (glasso) with lossless screening....in progress:2%
Conducting the graphical lasso (glasso) with lossless screening....in progress:5%
Conducting the graphical lasso (glasso) with lossless screening....in progress:7%
Conducting the graphical lasso (glasso) with lossless screening....in progress:9%
Conducting the graphical lasso (glasso) with lossless screening....in progress:12%
Conducting the graphical lasso (glasso) with lossless screening....in progress:15%
Conducting the graphical lasso (glasso) with lossless screening....in progress:17%
Conducting the graphical lasso (glasso) with lossless screening....in progress:19%
Conducting the graphical lasso (glasso) with lossless screening....in progress:22%
Conducting the graphical lasso (glasso) with lossless screening....in progress:25%
Conducting the graphical lasso (glasso) with lossless screening....in progress:27%
Conducting the graphical lasso (glasso) with lossless screening....in progress:30%
Conducting the graphical lasso (glasso) with lossless screening....in progress:32%
Conducting the graphical lasso (glasso) with lossless screening....in progress:35%
Conducting the graphical lasso (glasso) with lossless screening....in progress:37%
Conducting the graphical lasso (glasso) with lossless screening....in progress:40%
Conducting the graphical lasso (glasso) with lossless screening....in progress:42%
Conducting the graphical lasso (glasso) with lossless screening....in progress:44%
Conducting the graphical lasso (glasso) with lossless screening....in progress:47%
Conducting the graphical lasso (glasso) with lossless screening....in progress:50%
Conducting the graphical lasso (glasso) with lossless screening....in progress:52%
Conducting the graphical lasso (glasso) with lossless screening....in progress:55%
Conducting the graphical lasso (glasso) with lossless screening....in progress:57%
Conducting the graphical lasso (glasso) with lossless screening....in progress:60%
Conducting the graphical lasso (glasso) with lossless screening....in progress:62%
Conducting the graphical lasso (glasso) with lossless screening....in progress:65%
Conducting the graphical lasso (glasso) with lossless screening....in progress:67%
Conducting the graphical lasso (glasso) with lossless screening....in progress:70%
Conducting the graphical lasso (glasso) with lossless screening....in progress:72%
Conducting the graphical lasso (glasso) with lossless screening....in progress:75%
Conducting the graphical lasso (glasso) with lossless screening....in progress:77%
Conducting the graphical lasso (glasso) with lossless screening....in progress:80%
Conducting the graphical lasso (glasso) with lossless screening....in progress:82%
Conducting the graphical lasso (glasso) with lossless screening....in progress:85%
Conducting the graphical lasso (glasso) with lossless screening....in progress:87%
Conducting the graphical lasso (glasso) with lossless screening....in progress:90%
Conducting the graphical lasso (glasso) with lossless screening....in progress:92%
Conducting the graphical lasso (glasso) with lossless screening....in progress:95%
Conducting the graphical lasso (glasso)....done.
out## Model: graphical lasso (glasso)
## Input: The Data Matrix
## Path length: 40
## Graph dimension: 8
## Sparsity level: 0 -----> 0.4285714286
plot(out)The result exactly convinces the expectation I get above that Economy is the most significant variable related to Happiness score. With the decresing of lamda, the variable Health become signifcant related to Happiness score.
After analysing data of Global Happiness Levels in the world through 2015 to 2017, created by the United Nations Sustainable Development Solutions Network, we were able to discover the impact of each different factor in determining happiness.
According to the correlation plot and scatter plot, I had also found that among the seven factors, Economy tends to have the greatest effect on happiness with Health following close by. Thus if a country expects to get a higher happiness score in future, they should pay most attention on Economy and Health of citizens.
Besides, exploring the difference of happiness scores among different regions, I find that Europe and North America people live much happier comparing with Africa people.The happiest countries were located in Europe, particularly Scandinavia and Switzerland. Meanwhile the least happy countries were located in Africa and the Middle East. This suggests that countries in close proximity or those in the same region often have similar living conditions and are thus affected by factors similarly.
Furthermore, I dig into the deeper aspect trying to find a perfect model to fit into the happiness score formula. From the results what I get in the report, we can choose Multiple Linear Regression, Polynomial Regression, additive model and Neural Net model to predict the future happiness score since they are all present excellecent in fitting the dataset.
Based on what I find in this report, we can focus on improving and keeping the most important factors related to happiness score and government can get some tips and ideas which may helpful in achieving the better future. Through this, nations are able to achieve the true happiness which humans is always striving for.